Improving Data Cleaning Quality Using a Data Lineage Facility

نویسندگان

  • Helena Galhardas
  • Daniela Florescu
  • Dennis Shasha
  • Eric Simon
  • Cristian-Augustin Saita
چکیده

The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NEW TECHNIQUES FOR IMPROVING BIOLOGICAL DATA QUALITY THROUGH INFORMATION INTEGRATION by

NEW TECHNIQUES FOR IMPROVING BIOLOGICAL DATA QUALITY THROUGH INFORMATION INTEGRATION by Katherine Grace Herbert As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems ...

متن کامل

A revival of integrity constraints for data cleaning

Integrity constraints, a.k.a. data dependencies, are being widely used for improving the quality of schema. Recently constraints have enjoyed a revival for improving the quality of data. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.

متن کامل

Biological data cleaning: a case study

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardise...

متن کامل

Improving Data Quality in Intelligent Transportation Systems

—Intelligent Transportation Systems (ITS) use data and information technology to improve the operation of our transportation network. ITS contributes to sustainable development by using technology to make the transportation system more efficient; improving our environment by reducing emissions, reducing the need for new construction and improving our daily lives through reduced congestion. A ke...

متن کامل

Data Quality in Data Warehouses

Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001